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Abstract 

Two  significant  aspects  of  fault-tolerant  computing  were  the  focus  of  this  project. 
Concurrent  research  was  carried  out  as  well  in  the  areas  of  fault-tolerant  testable  VLSI 
system  design  and  fault-tolerant  multiprocessor  design.  A  novel  concept  for  testable 
RAM  designs  was  developed,  too,  allowing  for  the  design  of  large  RAMs  with  built-in 
test  capabilities.  Such  a  testability  feature  is,  in  fact,  an  integral  part  of  the  design, 
not  added  on  adhoc,  and  as  such,  is  the  subject  of  a  patent  application  filed  by  the 
U.S.  Air  Force. 

The  second  major  focus  of  research  concentrated  on  the  development  of  fault- 
tolerant  multiprocessor  topologies.  It  was  demonstrated  that  DeBruijn  multiproces¬ 
sor  networks  provide  a  naturally  fault-tolerant  robust  interconnection  network.  The 
attractive  feature  of  these  networks  includes  their  ability  to  provide  fault-tolerance 
in  a  wide  variety  of  applications.  Also  developed  was  a  new  topology,  termed  Flip 
TVees,  which  provides  certain  optimal  fault-tolerant  properties.  Finally,  a  practical 
perspective  on  distributed  agreement  algorithms  was  formulated,  which  can  admit  a 
large  variety  of  faults. 
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1  Introduction 


This  report  summarizes  the  research  carried  out  under  AFOSR  87-0161.  Three  distinct 
and  major  achievements  occurred  under  this  sponsorship.  First,  it  was  demonstrated  that 
the  DeBruijn  multiprocessor  network  provides  an  efficient  fault-tolerant  architecture  that 
can  be  useful  for  a  wide  variety  of  application.  This  result  is  the  thrust  of  a  paper  to 
appear  shortly  in  IEEE  Transactions  on  Computers.  Second,  a  new  concept  in  the  area 
of  RAM  design  was  developed  which  allows  for  the  design  of  large  RAMs  that  are  both 
testable  and  defect  tolerant  (Another  serendipitous  result  observed  is  that  the  proposed 
design  also  achieves  higher  levels  of  performance  than  the  traditional  design).  The  design 
is  the  basis  of  both  a  patent  application,  as  well  as  a  pending  publication. 

The  problem  of  achieving  consensus  in  a  distributed  system  was  also  explored.  In¬ 
vestigated  were  the  systems  where  two  types  of  faults  can  occur:  benign  (omission  and 
timing  faults)  and  malicious  (exhibiting  arbitrary  behavior).  A  continuum  was  established 
between  the  previous  results  when  no  malicious  faults  are  present  and  when  at  most  one 
third  of  the  nodes  are  faulty.  Additional  research  was  conducted  in  the  area  of  VLSI  yield 
models. 

The  following  section  elaborates  all  of  the  above  research  carried  out.  Section  111  lists 
all  resulting  publications,  and  all  student  supported.  Section  IV  briefly  overviews  future 
research  directions  and  plans. 

2  Summary  of  Research  Results 

2.1  Research  in  Fault-Tolerant  Multiprocessor  Architectures 

The  search  for  computationally  ellirient  multiprocessor  architectures  that  are  suitable  for 
VLSI  has  spawned  an  incre2isingly  itnportant  research  area.  Several  parallel  architectures 
which  solve  a  wide  variety  of  probN-ms  have  been  proposed.  These  include  the  linear  array 
(!]',  the  ring  (1),  the  complete  l)marv  tree  (CBT),  the  tree  machine  (TM),  the  shuffle- 
exchange  (SE),  the  cube-connrr ii'.l  (  \(  h>s  (CCC),  the  two-dimensional  mesh,  the  even 
double-exchange,  the  orthogonal  and  the  doubly-twisted  torus. 

Certain  of  the  real  life  problems  that  must  be  accomodated  have  been  successfully 
*Tlic  numbers  in  parenthesis  refers  to  publication  numbers  in  Section  3 
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grouped  into  various  classes.  These  classifications  are  important  because  statements 
such  as,  “all  problems  in  this  class  have  this  complexity”  inform  much  more  than  state¬ 
ments  like  “this  problem  is  of  this  complexity”.  Included,  among  these  classifications 
are  the  pipeline  class,  the  multiplex  class  ,  the  NP-complete  class,  the  ASCEND  and 
DESCEND  classes,  as  well  as  the  decomposable  searching  class. 

Firstly,  problems  in  the  pipeline  class  can  be  efficiently  solved  in  a  pipe  (linear  array). 
Depending  on  the  problem,  data  may  flow  in  one  direction  or  in  both  directions  simul¬ 
taneously.  Matrix-vector  multiplication  is  a  typical  example  of  those  problems  that  can 
be  solved  by  one-way  pipeline  algorithms.  Band  matrix-vector  multiplication,  recurrence 
evaluation  and  priority  queues  are  good  representations  of  problems  that  can  be  solved  by 
two-way  pipeline  algorithms. 

The  multiplex  class  covers  a  range  of  problems  characterized  by  [1]:  (1)  Operation  on 
N  data  operands  to  produce  a  single  result;  (2)  Evaluation  that  can  be  described  by  a 
tree.  Evaluation  of  general  arithmetic  expressions,  polynomial  evaluation,  etc.  is  included 
in  this  category.  The  natural  computation  graph  for  this  paradigm  is  a  tree  whose  nodes 
correspond  to  operations  and  whose  edges  correspond  to  data  flow  between  operations. 
The  CBT  (complete  binary  tree)  can  be  used  to  solve  the  problems  inherent  in  this  class. 

Another  important  class  of  problems  is  the  NP-complete  class  [24].  For  this  class,  the 
CBT  can  efficiently  implement  exhaustive  search  algorithms  [l|.  Here,  time  complexity 
still  is  exponential. 

ASCEND  and  DESCEND  classes  are  comprised  of  highly  parallel  algorithms  [15]. 
Here,  the  paradigm  of  the  algorithms  is  the  iterative  rendition  of  a  divide-and-conquer 
scheme.  The  input  and  output  are  each  a  vector  of  N{=  2*)  data  items;  ‘divide’  refers 
to  two  subproblems  of  equal  size,  where  the  “marry  step”  combines  the  results  of  two 
subproblems  consisting  of  the  execution  of  a  single  operation  on  the  corresponding  pairs 
of  data  items.  That  is:  assume  that  input  data  Dq,  Di,  ...Dn-i  are  stored,  respectively, 
in  storage  location  T(0|,T'[lj,...,7’jA/^  -  ij.  An  algorithm  in  the  DESCEND  class  performs 
a  sequence  of  basic  operations  on  pairs  of  data  successively  2*~* ,  2*"^, ...,  2‘, 2°  locations 
apart.  In  terms  of  the  above  divide-and-conquer  model,  the  marry  step  involves  pairs 
of  2°  locations  apart.  On  the  other  hand,  in  the  dual  class  (the  ASCEND  class),  basic 
operations  are  performed  on  the  data  that  are  successively  2^,2',  ...,2*~'  locations  apart; 
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the  marry  step  involves  pairs  of  2'‘  ^  locations  apart,  problems  which  can  be  solved  in  the 
SE  and  the  CCC. 

Problems  in  the  decomposable  searching  class  can  be  described  as  illustrated  below 
[l).  Preprocess  a  set,  F,  of  N  objects  into  a  data  structure,  D,  such  that  certain  kinds 
of  queries  about  F  can  be  answered  quickly.  A  searching  problem  is  decomposable  if  the 
response  to  a  query,  Q,  asking  the  relation  of  an  object,  z,  to  the  set,  F,  can  be  written 
as  Q{2,F}  =  Aq{z,f),  for  all  f  in  F,  where  f  is  an  element  in  F;  A  is  a  binary  operator 
which  is  associative,  commutative,  and  has  an  identity;  and  where  q  is  the  query  asking 
the  relation  of  the  object  z  to  the  element  f.  The  TM  described  in  [1]  solves  this  large 
class  of  searching  problems.  We  demonstrate  in  [1]  that  multiprocessor  networks  based 
on  binary  de  Bruijn  graphs  referred  to  as  binary  de  Bruijn  multiprocessors  (BDM)  and 
general  de  Bruijn  graphs  referred  to  as  de  Bruijn  multiprocessors  (DM)  can  solve  all  of  the 
above  classes  of  problems  efficiently.  Additionally,  it  is  shown  that  these  multiprocessor 
networks  can  be  used  as  versatile  sorting  networks. 

Sorting  is  a  theoretically  interesting  problem  with  a  great  deal  of  practical  significance. 
The  sorting  problem  as  defined  is  described  below.  We  are  given  N  items; 

Di->D2-,Dzy ...,  Dfz-iiDff 

to  be  sorted;  we  shall  call  them  data  items  (or  records).  Each  data  item,  Dj,  has  a 
key,  Kj,  which  governs  the  sorting  process.  The  object  of  such  sorting  is  to  determine  a 
permutation  p(l)p(2)  •  •  •  p{N)  of  the  data  items,  which  puts  the  keys  In  nondecreasing 
order; 

^p(l)  ^  ^p{2)  ^  ^  ^p(N) 

Usually,  we  output  the  sorted  sequence  or  the  tth  smallest  item  is  placed  in  the  tth 
processor. 

A  classification  method  for  sorting  architectures  was  presented  by  Winslow  and  Chow. 
The  sorters  have  been  classified  into  the  following  categories: 

(A)  Sequential  Input/Sequential  Output  (SI/SO) 

(B)  Parallel  Input/Sequential  Output  (PI/SO) 

(C)  Parallel  Input/Parallel  Output  (PI/PO) 


4 


(D)  Sequential  Input/Parallel  Output  (SI/PO) 

(E)  Hybrid  Input/IIybrid  Output  (HI/HO) 

Note  that  the  classification  is  based  not  only  on  the  I/O  method,  but  also  on  the  inter¬ 
connection  network,  the  sorting  algorithm  and  the  type  of  keys  used. 

Our  paper  [l]  demonstrates  that  the  de  Bruijn  multiprocessor  networks  can  be  used  to 
sort  elements  in  all  of  the  five  categories.  The  main  advantages  are  four-fold  in  having  an 
interconnection  network  which  can  sort  data  items  in  all  of  the  categories,  as  the  following 
situations  can  be  handled: 

1.  Although  it  is  theoretically  possible  to  load  the  data  items  in  parallel,  the  number 
of  ports  available  for  I/O  may  be  limited. 

2.  Even  though  the  I/O  ports  are  available,  it  may  not  be  possible  to  load  the  data 
items,  from  the  secondary  storage,  in  parallel. 

3.  Different  sets  of  data  may  have  different  types  of  keys. 

4.  In  practice,  there  may  be  faults  in  the  network. 

For  each  of  these  categories,  time  complexity  and  size  complexity  is  given  in  [l],  time 
complexity  being  the  worst  case  time  required  to  sort  the  data  items,  size  complexity,  the 
number  of  data  items  that  can  be  sorted. 

2.2  Research  in  VLSI  Systems 

During  this  past  year,  we  have  addressed  the  problems  of  developing  methodologies  for 
the  design  VLSI  systems  that  are  defect/fault- tolerant  and  testable,  and  of  evaluating 
their  cost/performance  [2,3|.  Today’s  complex  VLSI  systems  place  additional  burdens  on 
the  designer  -  demanding  that  designs  not  only  meet  specifications,  but  are  also  testable, 
reliable,  and  can  be  fabricated  with  reasonable  yields.  Design  methodologies  are  therefore 
essential  which  enhance  both  the  testability  and  yield  of  VLSI  systems.  Also,  it  is  impor¬ 
tant  to  be  able  to  provide  the  designer  with  certain  feedback,  early  in  the  design  cycle, 
ensuring  that  correct  decisions  regarding  testability  and  yield  enhancement  can  be  made. 
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Finally,  we  must  be  able  to  compare  the  different  techniques  available  for  enhancing  these 
metrics  so  that,  for  a  given  application,  a  suitable  technique  can  be  chosen. 

The  proposed  design  methodology  for  future  multi-megabit  UR  AMs,  has  the  following 
properties  [2]; 

•  Easily  Testable:  Dividing  the  nodes  into  modules  reduces  the  size  of  the  problem 
and  reduces  the  test  time.  Testing  these  nodes  in  parallel,  as  well  as  having  on- 
chip  test  evaluation  further  reduces  the  test  time.  Therefore,  what  this  architecture 
results  in  is  practical  testing  times  for  multi-megabit  RAMs. 

•  Low  Area  Overhead:  The  additional  area  required  for  large  RAMs  is  typically 
8-20%.  Depending  on  the  total  area  allowed,  the  designer  can  select  the  apprcnriate 
node  granularity,  taking  into  consideration  the  other  tradeoffs  involved. 

•  Improved  Performance: 

•  For  large  RAMs,  this  architecture  can  be  faster  a  potential  reduction  in  access  time  is 
about  30%.  Because  performance  enhancement  also  depends  upon  node  granularity, 
it  can  be  selected  by  the  designer. 

•  Refreshing  the  nodes  in  parallel  substantially  reduces  the  amount  of  time  the  RAM  is 
not  available.  Such  a  reduction  in  refresh  time  heis  to  be  traded  off  against  an  increase 
in  power  dissipation  that  results  from  all  nodes  being  refreshed  simultaneously. 

•  Partitionable  and  Restructurable:  These  two  properties  make  it  possible  to 
salvage  defective  chips  as  partially  good  chips;  enhances  effective  wafer  yield,  making 
fabrication  of  very  large  memory  systems  economically  viable.  Because  restructuring 
involves  address  remapping.  It  ran  be  used  as  part  of  a  highly  reliable,  self-test,  self¬ 
repair  system,  where  the  tester  ran  program  the  address  map  in  real  time. 

Those  VLSI  models  developcl  '■>  evaluate  the  cost/performance  of  the  TRAM  archi¬ 
tecture  were  used  to  evaluate  tlx-  c  .>-i  performance  when  error  control  codes  were  used 
to  protect  DRAMs  against  soft  errors,  and  to  enhance  defect  tolerance.  Two  codes  -  the 
product  code  with  full  code  word  correction  and  the  odd-weight-column  code,  were  an¬ 
alyzed  for  DRAMs.  The  important  advantage  these  codes  possess  us  that  they  are  free 


of  high  error  latency  and  large  time  to  scrub  the  RAM  of  proposals.  The  code  analysis 
demonstrated  that  in  spite  of  an  increase  in  area,  the  yield  is,  in  fact,  enhanced.  Also, 
there  is  only  a  moderate  performance  penalty  for  implementing  either  the  odd  weight  code 
or  the  full  product  code,  for  large  RAMs.  However,  the  odd-weight  code  has  better  yield 
and  lower  performance  cost  that  the  product  code  [2j. 

The  TRAM  architecture  can  be  used  as  a  basic  module  in  a  wafer  scale  memory  system, 
discussed  in  detail  in  the  section  that  follows.  The  TRAM  VLSI  models  can  be  used  as 
benchmarks  to  evaluate  and  compare  the  cost/performance  of  other  defect/fault-tolerant, 
testable  systems.  The  methodology  can  then  be  used  to  develop  benchmarks  for  other 
architectures.  Also,  the  VLSI  modeling  technique  can  be  developed  into  a  design  tool  that 
can  aid  researchers  and  designers  to  estimate  the  cost/performance  of  VLSI  systems,  as 
well  as  integrating  knowledge  about  testability  techniques. 

2.3  Research  in  Fault-Tolerant  Multiprocessor  Distributed  Sys¬ 
tems 

In  distributed  systems,  our  work  is  strongly  dependent  on  the  graph  model  used  to  describe 
the  system.  The  nodes  (communication  links)  of  the  system  are  reflected  by  the  vertices 
(edges)  of  a  graph.  We  have  enumerated  several  desirable  properties  of  a  system  graph 
(low  internode  distances,  high  connectivity,  easy  routing,  etc.).  Mostly  researchers  seek 
graphs  that  are  best  for  only  one  desirable  property,  but  those  graphs  obtained  are  usually 
of  no  practical  use  because  they  lack  relation  to  the  other  properties.  This  very  problem 
we  addressed  this  problem  in  [4].  We  also,  gave  a  very  general  method  for  constructing 
graphs,  termed  color  product  construction  which  can  generate  many  well-known  graphs 
(such  as  hypercube,  cube-connortod  cycles,  and  generalized  Petersen).  We  were  able  to 
show  that  colored  product  graphs  .ur,  indeed,  a  fertile  field  for  graphs,  good  for  every  one 
of  the  desirable  graph  properties  w*-  iiavc  enumerated. 

In  addition,  we  have  examim  d  other  particular  graphs,  developing  a  family  of 

graphs  called  flip-trees,  obtained  lo  interconnecting  the  leaves  of  Moore  trees.  Flip-trees 
are  regular  and  have  optimal  conne.  ii\iiy;  if  a  graph  is  c-connected,  then  between  every 
pair  of  nodes  there  exists  a  set  of  (  node-disjoiiit  paths.  We  refer  to  such  a  set  of  node- 
disjoint  paths  as  a  container.  Tlie  length  of  a  container  is  dictated  by  the  path  in  it  of 
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worst  length.  The  best  containers  (shortest)  of  width  c  are  of  interest,  especially  when 
c  is  the  connectivity.  In  [5|,  we  detnonstratcd  that  flip-trees  have  containers  of  maximal 
width,  with  length  one  more  than  twice  the  Moore  bound.  Flip-trees  are  the  graphs  with 
the  best-known  containers. 

A  practical  perspective  on  distributed  agreement  algorithms  is  given  in  [5].  There, 
we  showed  that  more  faults  can  be  accommodated  (than  in  prior  work)  if  some  faults 
are  of  a  less  serious  nature.  The  algorithm  given  does  not  require  a  priori  knowledge 
of  less  severe  faults.  An  interesting  aspect  of  this  work  are  the  requirements  to  establish 
communication  between  any  two  nodes.  Messages  are  sent  in  a  container  so  that  each  fault 
may  only  destroy/corrupt  one  message.  \  simple  protocol,  along  with  relaying  rules  for 
intermediate  nodes  establish  a  virtual  link  between  any  two  nodes.  Separate  protocols  are 
given  for  the  case  where  the  receiver  knows  a  message  will  be  sent,  and  for  the  case  where 
the  receiver  does  not  know  when  the  next  message  will  be  sent.  The  protocols  achieve 
communication  wherever  the  faults  in  the  system  are  few  enough  to  allow  it.  We  intend  to 
further  break  down  faults  according  to  severity,  so  as  to  be  able  to  assess  whether  further 
gains  is  feasible.  Also  we  intend  to  analyze  and  compare  different  methods  of  establishing 
virtual  links  (various  coding  schemes,  sending  multiple  messages,  etc.). 

Work  we  have  previously  reported  on,  on  System-level  diagnosis  will  appear  in  [7]. 
We  are  interested  in  evaluating  the  usefulness  of  containers  in  distributed  diagnosis.  A 
probabilistic  approach  will  allow  us  to  ascribe  behavior  to  faulty  nodes,  instead  of  assuming 
they  can/cannot  find  other  nodes  faulty.  An  interesting  question  is  how  much  is  lost  if 
only  the  tests  within  a  container  are  performed  instead  of  all  possible  tests.  Also,  a  metric 
is  needed  to  consistently  interpret  the  results. 

2.4  Other  Research  Designing  Buses  for  Maximum  Yield  and 
Minimum  Delay 

It  has  been  common  in  recent  publications  concerned  with  fault-tolerance  in  VLSI  and 
WSI  to  assume  that  interconnection  buses  can  be  design  to  be  almost  defect-free  by  en¬ 
larging  the  width  of  the  lines  and  spacing  between  lines.  Although  this  assumption  may 
often  be  valid,  the  cost-effectiveness  of  this  proposed  “robust”  bus  layout  is  questionable, 
particularly  in  the  case  of  wide  buses  (e.g.  32  bit  wide). 
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In  our  paper  [8],  we  derive  exact  expressions  for  the  yield  of  an  interconnection  bus, 
as  a  function  of  its  physical  dimensions  and  the  parameters  and  distribution  of  the  pos¬ 
sible  open-circuit  and  short-circuit  defects.  Also  examined  is  the  effect  of  introducing 
redundancy  into  the  bus,  as  well  as  obtaining  the  optimal  layout  of  a  given  bus  (with  and 
without  redundancy). 

Any  change  in  the  layout  of  a  bus  may  affect  the  propagation  delay  of  the  bus  and  as  a 
consequence,  the  performance  of  the  VLSI  chip.  Hence,  in  addition  to  its  yield,  the  delay 
of  the  designed  bus  must  be  taken  into  account  when  determining  the  final  layout  of  the 
bus.  Both  yield  and  delay  are  discussed  in  this  paper  through  several  examples. 

3  Publications  and  Students  Supported 

3.1  Publications 

1.  D.K.  Pradhan  and  M.R.  Samatham,  “The  DeBruijn  Multiprocessor  Network:  A 
Versatile  Parallel  Processing  and  Sorting  Network  for  VLSI” ,  IEEE  Transactions  on 
Computers,  (to  appear). 


2.  N.  Jarwala  and  D.K.  Pradhan,  “TRAM:  A  Design  Methodology  for  Testable  Fault- 
Tolerant  Large  RAMs”,  IEEE  Transactions  on  Computers,  Vol.  C-37,  Oct.  1988, 
U.S.  Patent  Pending. 


3.  N.  .larwala  and  D.K.  Pradhan,  “Cost  Analysis  of  On-Chip  Error  Control  Coding  for 
Fault-Tolerant  Dynamic  RAMs”,  Proc.  of  17th  International  Symposium  on  Fault- 
Tolerant  Computing,  Pittsburgh,  PA,  July  6-8,  1987,  pp.  278-282. 


4.  D.K.  Pradhan  and  F.J.  Meyer,  “Communication  Structures  in  Distributed  Systems,” 
Proc.  10th  fault- Tolerant  Systems  and  Diagnostics  Conf.,  Varna,  Bulgaria,  pp.  193- 
202,  September  1987. 


5.  F.J.  Meyer  and  D.K.  Pradhan.  “Flip-Trees:  Fault  Tolerant  Graphs  with  Wide  Con¬ 
tainers,”  IEEE  Transactions  on  Computers,  Vol.  C-37,  No.  4,  (to  appear),  April, 
1988. 


6.  F.J.  Meyer  and  D.K.  Pradhan,  “Consensus  with  Dual  Failure  Modes,”  Proc.  17th 
Int.  Symp.  on  Fault- Tolerant  Comput.,  Pittsburgh,  PA,  pp.  48-54,  July  1987. 
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7.  F.J.  Meyer  and  D.K.  Pradhan,  “Dynamic  Testing  Strategy  for  Distributed  Systems,” 
IEEE  Transactions  on  Computers,  (to  appear). 

8.  I.  Keren,  Z.  Keren  and  D.K.  Pradhan,  “Designing  Interconnection  Buses  in  VLSI 
and  WSl  for  Maximum  Yield  and  Minimum  Delay”,  IEEE  Journal  of  Solid  State 
Circuits,  (to  appear),  May  1988. 

3.2  Students  Supported 

•  Fred  Meyer,  Ph.D.  Student 

•  Najmi  Jarwala,  Ph.D.  Student 

4  Future  Research 

The  following  elaborates  on  tractable  future  research  directions.  The  following  research  is 

being  carried  out  in  the  area  of  defect/fault-tolerant  VLSI  system  design. 

•  Studying  the  problem  of  designing  defect/fault  tolerant,  testable  VLSI  systems,  a 
study  emphasizes  DRAMs. 

•  Developing  an  architecture  for  Multi-Megabit  DRAMs  that  is  easily  testable,  parti- 
tionable,  and  restructurable. 

•  Developing  a  fault  model  and  test  algorithms. 

•  Studying  the  issues  involved  in  developing  a  VLSI  macro-modeling  technique,  so  as 
to  analyze  and  compare  the  cost/performance  of  VLSI  structures.  Developing  VLSI 
models  that  compute  the  cost/performance  tradeoffs  of  the  memory  architecture. 

•  Analyzing  the  yield  potential  of  the  above  architecture. 

•  Studying  the  suitability  of  the  above  architecture  for  Wafer  Scale  Integration. 

•  Exploring  the  cost/performance  tradeoff  in  using  error  correcting  codes  to  protect 
DRAMs  against  soft  errors  by  using  the  modeling  technique  earlier  developed.  An¬ 
alyzing  the  effect  on  yield.  Exploring  codes  other  than  the  commonly-used  product 
codes  for  this  application. 
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With  respect  to  reliable  broadcast,  the  reliability  achieved  with  the  mixed  algorithm  is 
always  superior  to  both  benign  and  malicious  algorithms  [bj.  A  mixed-sum  algorithm  was 
introduced  in  [6]  which  achieves  the  provably  maximal  reliability  (under  the  dual  failure 
mode  model),  but  the  message  complexity  of  the  algorithm  is  substantial.  Further  research 
should  be  devoted  to  finding  a  more  feasible  mixed-sum  algorithm,  eis  substantial: 

We  assumed  that  the  probability  of  a  failure  and  that  the  probability  of  failure  malicious 
are  independent  of  the  algorithm  used.  Actually,  an  electrical  fault  would  appear  malicious 
to  a  benign  algorithm  more  often  than  to  a  malicious  algorithm,  because  a  benign  algorithm 
is  simpler.  For  purposes  of  comparison,  though,  this  probability,  P{m),  was  taken  to  be 
the  same  for  each  algorithm.  Also,  the  probability  of  failure  has  two  components — due  to 
permanents  and  due  to  transients.  The  impact  of  permanent  faults  can  be  expected  to 
remain  largely  the  same  for  any  algorithm,  but  the  impact  of  transients  is  related  to  the 
amount  of  message  traffic.  This  decomposition  of  faults  has  yet  to  be  modelled. 

With  respect  to  communication  topologies,  a  big  advance  cannot  be  made  by  finding 
individual  graphs  and  sequences  of  graphs;  general  construction  methods  are  needed  that 
ensure  reasonable  graphs  with  respect  to  all  important  parameters.  A  construction  method 
has  been  suggested  here  that  could  fill  this  need,  as  could  other  similar  methods.  Further 
efforts  should  concentrate  on  such  methods.  Many  good  graphs  are  available  via  colored 
product  construction,  but  these  need  to  be  developed: 

(1)  A  thorough  enumeration  of  the  useful  colored  product  graphs  is  in  progress. 

(2)  The  fault- tolerant  diameters  of  many  colored  product  graphs  are  known,  but  general 
results  establishing  the  fault-tolerant  diameters  are  needed. 

(3)  Of  interest  are  colored  graphs  using  a  palette  with  colors  that  do  not  commute. 

Flip-trees  were  shown  in  (5|  to  he  c  ompetitive  with  respect  to  many  aspects  of  network 
topologies,  such  as  diameter  and  fault-tolerant  diameter;  as  they  possess  the  best-known 
containers.  The  primary  areas  of  <I<-Ih  irury  are:  (1)  traffic  congestion  and  (2)  distributed 
routing  with  localized  routing  infortual ion.  Further  research  is  underway  to  address  such 
deficiencies. 

While  it  is  understood  that  uncletectable  faults  can  cause  difficulties,  the  extent  of 
that  problem  is  not  known.  Also,  it  is  not  clear  whether  countermeasures  (test  set  exten- 
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sion,  addition  of  observation  points,  etc.)  are  worthwhile.  We  see  tliis  area  of  research 
developing  along  the  following  lines: 

(1)  provide  analytic  function  to  assess  the  impact  of  undetectable  faults 

It  will  be  necessary  to  model  the  impact  of  undetectable  faults  that  are  known  as  well 
as  those  that  are  unknown.  Known  undetectable  faults  are  important,  because:  (1) 
products  may  already  be  deployed  in  the  field  when  redundant  design  is  discovered 
and  (2)  considerations  such  as  circuit  delay  might  dictate  a  redundant  design.  Un¬ 
detectable  faults  that  are  not  known  are  important  because  the  problem  of  deciding 
whether  a  fault  is  detectable  or  intractable.  Knowing  the  location  of  redundant  leads 
can  be  used  in  two  direct  ways:  (a)  test  generation  can  include  a  check  for  double 
faults  involving  one  of  the  suspect  leads,  and  also  include  a  search  for  additional 
tests,  if  necessary  and  (b)  FRUs  with  the  most  undetectable  faults  are  the  most 
likely  source  of  failure  when  diagnostics  do  not  isolate  the  failed  FRU. 

(2)  extend  analysis  to  predict  reliability  of  chip  at  run  time 

Many  faults  that  occur  in  the  field  elude  detection.  We  conjecture  that  some  of  these 
faults  are  second  faults  (with  the  first  fault  being  an  undetectable  fault).  The  first 
(undetectable)  fault  might  occur  in  the  field  or  during  manufacture.  If  field  service 
encounters  such  a  circumstance,  then  it  may  not  be  able  to  isolate  the  FRU.  11  the 
chip  has  a  BIST  feature,  then  its  test  set  may  be  invalid  due  to  the  manufactui  iig 
defect  that  was  not  detectable. 

(3)  provide  tools  to  predict  the  effectiveness  of  countermeasures 

One  key  consideration  is  the  ami  racy  of  attempts  to  locate  redundant  leads.  Without 
a  good  knowledge  of  the  untestal)le  .md/or  diflicult-to-test  portions  of  a  circuit,  we  see  no 
effective  way  to  prevent  test  in vali-i.ii ion.  One  of  the  more  promising  countermeasures  is 
to  aid  observability  by  adding  oi.  .  r  -., it  ion  points.  This  would  be  especially  frugal  for  a 
BIST  methodology,  because  no  rxii.t  |>ins  would  be  needed  for  the  package. 
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